## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double()
## )
## ℹ Use `spec()` for the full column specifications.

Our project is focused on gaining an insight on the effect early childhood household structure and individual confidence has on income levels later in life. In order to investigate this relationship, we will focus on 11 variables: race, sex, total number of incarcerations, household net worth, father’s educational attainment, mother’s educational attainment, perceived safety at school, perceived chance at completing high school by 20, perceived chance of being in jail by 20, perceived chance of being a parent by 20, and perceived chance of graduating college by 30.

Before we proceed any further let us first look at income variable and check whether there are any outliers or not.

nlsy1$HSD20.copy <- nlsy1$HSD20
nlsy1$College30.copy <- nlsy1$College30
nlsy1$Parent20.copy <- nlsy1$Parent20
nlsy1$Jail20.copy <- nlsy1$Jail20
nlsy1$HSD20.copy = cut(nlsy1$HSD20.copy, c(0, 20, 40, 60, 80, 100))
nlsy1$College30.copy = cut(nlsy1$College30.copy, c(0, 20, 40, 60, 80, 100))
nlsy1$Parent20.copy = cut(nlsy1$Parent20.copy, c(0, 20, 40, 60, 80, 100))
nlsy1$Jail20.copy = cut(nlsy1$Jail20.copy, c(0, 20, 40, 60, 80, 100))

Getting to know our two main variables.

#getting basic summary for income
summary(nlsy1$income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      -5      -5   14000   28035   45000  235884

The max value seems to very high as compared to the third quartile.

Let us create a histogram followed by box plot to indentify any outliers.

There are outliers in the dataset, but let us see if it exist across races or not.

There are a few outliers which fall outside the 1.5 times the range of the median.So we can drop these values as they will skew the data. There are negative values and based on the codebook it is for those respondants that either did not respond which is a very few or did not interview. Let us make them NA as well and replot the box plot.

Let us also remove the income NA values, because that is variable for which we checking against so we need to have values for this variable

#getting total observations after removing income na values.
kable(nlsy %>% summarise("Total Observarations" = n()), align="c")
Total Observarations
4970
#getting distribution based on sex and race
kable(nlsy %>% group_by(race ,sex) %>% summarise("Sex Distribution" = n()))
## `summarise()` regrouping output by 'race' (override with `.groups` argument)
race sex Sex Distribution
Non-Black/Non-Hispanic Female 1241
Non-Black/Non-Hispanic Male 1397
Black Female 664
Black Male 576
Hispanic Female 513
Hispanic Male 532
Mixed Race(Non-Hispanic) Female 22
Mixed Race(Non-Hispanic) Male 25
kable(nlsy %>% group_by(race) %>% summarise("Race Distribution" = n()))
## `summarise()` ungrouping output (override with `.groups` argument)
race Race Distribution
Non-Black/Non-Hispanic 2638
Black 1240
Hispanic 1045
Mixed Race(Non-Hispanic) 47

Let us check the distribution for income across sex and race both.

We can see that there is difference between the average income across all races. The next thing that we should check is to see whether the difference in average income is significant or not. After that we will check whether there are any other external factors affecting the average income of people.

There seems to siginificant difference between the average incomes between all the races except that of mixed race. The samples for that race is very low and can be neglected for the analysis for this purposes.

To understand what can be the following factors affecting such differences in income across races let us look at other variables as well. For the purposes of this analysis we are going to use the following resources and:

  1. race (black and non-black): race : Identified race of the individual.

  2. sex: sex : Identified sex of the individual

  3. total # of incarcerations : TotalIncarceration : Total number of separate incarcerations reported by the individual.

  4. FEEL SAFE AT SCHOOL AGREE/DISAGREE (?) : FeelSafe : ‘Strongly Agree’, ‘Agree’, ‘Disagree’, ‘Strongly Disagree’

  5. PERCENT CHANCE R HAS HIGH SCHOOL DIPLOMA BY 20 YEARS OLD: HSD20 : Self-reported percent chance that individual believes they will have received a high school diploma by age 20.

  6. PERCENT CHANCE R IN JAIL BY 20 YEARS OLD: Jail20 : Self-reported percent chance that individual believes they will be in/have been in jail by age 20.

  7. PERCENT CHANCE R A PARENT BY 20 YEARS OLD: Parent20 : Self-reported percent chance that individual believes they will be a parent by age 20.

  8. PERCENT CHANCE R HAS COLLEGE DEGREE BY 30 YEARS OLD: College30 : Self-reported percent chance that individual believes they will have received a college degree by age 30.

  9. NET WORTH OF HOUSEHOLD ACCORDING TO PARENT: ParentNetworth : Net worth of household as reported by the individual’s parent.

  10. BIOLOGICAL FATHERS HIGHEST GRADE COMPLETED: HGC_BIO_DAD : Highest grade completed by the individual’s biological father.

11.BIOLOGICAL MOTHERS HIGHEST GRADE COMPLETED: HGC_BIO_MOM : Highest grade completed by the individual’s biological mother.

Let us now look at each and every categorical variable and see if there is any impact on the income.

Average income based on parents education:

For both parents education the seems to have some positive correlation but it seems to very high for parents whose highest education is just 1 year. Let us run a t-test to verify that whether there is any significant difference when there is an increase in 1 year of education.

t.test(nlsy$income,nlsy$HGC_BIO_MOM1,na.rm=TRUE)
## 
##  Welch Two Sample t-test
## 
## data:  nlsy$income and nlsy$HGC_BIO_MOM1
## t = 110.05, df = 4969, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  44125.85 45726.44
## sample estimates:
##   mean of x   mean of y 
## 44938.74366    12.59662
t.test(nlsy$income,nlsy$HGC_BIO_DAD1,na.rm=TRUE)
## 
##  Welch Two Sample t-test
## 
## data:  nlsy$income and nlsy$HGC_BIO_DAD1
## t = 110.05, df = 4969, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  44125.79 45726.39
## sample estimates:
##   mean of x   mean of y 
## 44938.74366    12.65296

There seems to be significant difference income when education increases. It holds true for both parents.

In our final regression, we treat both parents’ education as a continuous variable rather than as a categorical variable. Given the coefficient values on both of these variables, it is easier to interpret the effect as a one year increase of parents’ education on the child’s future income.

Next we can investigate the effects of the educational variables, ‘HSD20’ and ‘College30’.

t.test(nlsy$income, nlsy$HSD20, na.rm=TRUE)
## 
##  Welch Two Sample t-test
## 
## data:  nlsy$income and nlsy$HSD20
## t = 109.85, df = 4969, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  44043.75 45644.34
## sample estimates:
##  mean of x  mean of y 
## 44938.7437    94.6991
t.test(nlsy$income, nlsy$College30, na.rm=TRUE)
## 
##  Welch Two Sample t-test
## 
## data:  nlsy$income and nlsy$College30
## t = 109.9, df = 4969, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  44063.65 45664.25
## sample estimates:
##   mean of x   mean of y 
## 44938.74366    74.79396

We can test ‘Jail20’ and ‘Parent20’ similarly:

t.test(nlsy$income, nlsy$Jail20, na.rm=TRUE)
## 
##  Welch Two Sample t-test
## 
## data:  nlsy$income and nlsy$Jail20
## t = 110.07, df = 4969, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  44133.35 45733.95
## sample estimates:
##    mean of x    mean of y 
## 44938.743662     5.094609
t.test(nlsy$income, nlsy$Parent20, na.rm=TRUE)
## 
##  Welch Two Sample t-test
## 
## data:  nlsy$income and nlsy$Parent20
## t = 110.04, df = 4969, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  44121.29 45721.88
## sample estimates:
##   mean of x   mean of y 
## 44938.74366    17.15815

Similar to ‘HSD20 and ’College30’, the interpretations of these two variables as percentages is helpful to our overall research since the coefficient estimate will signal the change in income relative to the change in confidence category. This is preferable to a 1% change in reported confidence that we would to have had interpreted the coefficient as if the variables were not categorized.

Let us finally look at the last categorical and see whether that has any impact visible impact in the income of the respondent.

Highest proportion of people who feel that they are not safe are Black, and Non-black and Non-Hispanic feel the safest.

Getting distribution for incarcerations for race type.

kable(nlsy %>% group_by(race) %>% filter(TotalIncarcerations!=0) %>% 
    summarise("Total Incarcerated" = n(),"Total Incarcerations" = sum(TotalIncarcerations),
              "Avg Incarcerations" = round(mean(TotalIncarcerations,na.rm = TRUE),2)))
## `summarise()` ungrouping output (override with `.groups` argument)
race Total Incarcerated Total Incarcerations Avg Incarcerations
Non-Black/Non-Hispanic 191 332 1.74
Black 115 200 1.74
Hispanic 88 149 1.69
Mixed Race(Non-Hispanic) 6 6 1.00

Highest number of incarcerated people are Non Black/Non Hispanic followed by Black and Hispanic. Average incarcerations is basically the same for all race groups, except Mixed race which basically has only 12 people incarcerated. But it will be interesting to see how it affects the future income for these difference race groups. Let us do a t. test to check our hypothesis whether there is any difference or not.

t.test(nlsy$income,nlsy$TotalIncarcerations,na.rm=TRUE)
## 
##  Welch Two Sample t-test
## 
## data:  nlsy$income and nlsy$TotalIncarcerations
## t = 110.08, df = 4969, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  44138.31 45738.90
## sample estimates:
##    mean of x    mean of y 
## 4.493874e+04 1.383965e-01

There does seem to significant affect on respondants income due to times they have been incarcerated.

Now that we know all this let us look at rest of our continuous variables so lets check correlation between the above categorical variables and continuous variables.

economic.var.names <- c("ParentNetworth", "College30", "Parent20", "Jail20", "HSD20","TotalIncarcerations")

ggpairs(nlsy[,c(economic.var.names, "race")], axisLabels = "internal")

Just for this purpose, we used “College30”, “Parent20”, “Jail20”, and “HSD20” as continuous variables rather than categorical for better graphical representation.

There doesn’t seem to be any relation between the continous variables so we can keep all the variables for our linear regression model. Let us put in a table and try and understand its significance.

kable(round(cor(nlsy[,economic.var.names], use='complete.obs'), 3))
ParentNetworth College30 Parent20 Jail20 HSD20 TotalIncarcerations
ParentNetworth 1.000 0.207 -0.164 -0.075 0.138 -0.070
College30 0.207 1.000 -0.231 -0.197 0.338 -0.153
Parent20 -0.164 -0.231 1.000 0.341 -0.207 0.120
Jail20 -0.075 -0.197 0.341 1.000 -0.171 0.139
HSD20 0.138 0.338 -0.207 -0.171 1.000 -0.092
TotalIncarcerations -0.070 -0.153 0.120 0.139 -0.092 1.000

There seems to be negative correlation between respondants being in Jail by the age of 20,respondent being parent by the age of 20 against all variable.

Now let us look create the regression models to see whether these variables have any siginificant affect on income for people across different race.

lm.base <- lm(income ~ race + sex, data = nlsy)
kable(round(summary(lm.base)$coefficients, 3), format='markdown')
Estimate Std. Error t value Pr(>|t|)
(Intercept) 43638.528 680.767 64.102 0.000
raceBlack -12685.343 953.932 -13.298 0.000
raceHispanic -6168.497 1011.340 -6.099 0.000
raceMixed Race(Non-Hispanic) -367.628 4071.201 -0.090 0.928
sexMale 11326.188 786.120 14.408 0.000

This initial model of the impact of race on future income shows that there is a statistically significant difference between Black males and the baseline White males. However, this does not narrow down the effect enough for us to answer our research question in an acceptable manner. To further isolate the results we are looking for, we can include the first set of variables into the regression model.

lm.household <- lm(income ~ race + sex + ParentNetworth +TotalIncarcerations + HGC_BIO_MOM1 + HGC_BIO_DAD1, data = nlsy)
kable(round(summary(lm.household)$coefficients, 3), format='markdown')
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20509.089 2953.153 6.945 0.000
raceBlack -7717.261 1338.588 -5.765 0.000
raceHispanic 1448.719 1470.252 0.985 0.325
raceMixed Race(Non-Hispanic) -1714.457 5196.013 -0.330 0.741
sexMale 12591.937 1014.849 12.408 0.000
ParentNetworth 0.030 0.004 7.481 0.000
TotalIncarcerations -8244.739 1026.835 -8.029 0.000
HGC_BIO_MOM1 688.393 234.604 2.934 0.003
HGC_BIO_DAD1 742.158 212.917 3.486 0.000

However, the terms ‘race’ and ‘TotalIncarceration’ deserve further attention. In the United States, the effect of incarceration differs depending on ethnic group. Previous research has shown that Blacks have a higher recidivism rate than other ethnic groups. Therefore, the effect of ‘TotalIncarcerations’ could depend on the value of ‘race’ in our regression model. An interaction term is added to the regression to test this.

lm.household2 <- lm(income ~ race + sex + ParentNetworth +TotalIncarcerations + HGC_BIO_MOM1 + HGC_BIO_DAD1+ race*TotalIncarcerations, data = nlsy)
kable(round(summary(lm.household2)$coefficients, 3), format='markdown')
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20532.135 2959.504 6.938 0.000
raceBlack -7791.990 1374.712 -5.668 0.000
raceHispanic 1480.509 1509.161 0.981 0.327
raceMixed Race(Non-Hispanic) -658.813 5399.680 -0.122 0.903
sexMale 12583.909 1015.429 12.393 0.000
ParentNetworth 0.030 0.004 7.460 0.000
TotalIncarcerations -8283.863 1404.347 -5.899 0.000
HGC_BIO_MOM1 687.137 234.782 2.927 0.003
HGC_BIO_DAD1 743.005 213.189 3.485 0.000
raceBlack:TotalIncarcerations 509.580 2393.584 0.213 0.831
raceHispanic:TotalIncarcerations -304.501 2666.835 -0.114 0.909
raceMixed Race(Non-Hispanic):TotalIncarcerations -14314.142 19719.706 -0.726 0.468

The inclusion of the interaction term does show evidence of ‘TotalIncarceration’ having a higher effect on Blacks. However, the findings are not significant at any level so they can be disregarded in the model.

By including the household set of variables, the coefficient on Black has changed. In the base model, Black males were earning around $7717 less than their White male counterparts. However, in the household model, Black males are earning around $7791 less than their White male counterparts holding all other attributes constant. Since the regression model is now capturing more independent variables, the effect of Black is decreasing. These other independent variables were being captured by the Black coefficient prior to their inclusion. Now, ‘sexFemale’ and ‘TotalIncarcerations’ have a larger effect on income than Black.

Now we can expand our analysis to confidence variables (FeelSafe, HSD20, Jail20, Parent20, College30).

summary(nlsy$FeelSafe)
##    Strongly agree             Agree          Disagree Strongly Disagree 
##              1602              2679               529               150 
##         (Missing) 
##                10
summary(nlsy$HSD20)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   100.0   100.0    94.7   100.0   100.0    3079
summary(nlsy$Jail20)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   5.095   3.000 100.000    3078
summary(nlsy$Parent20)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.00    5.00   17.16   25.00  100.00    3092
summary(nlsy$College30)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   50.00   90.00   74.79  100.00  100.00    3082

These variables required some adjustment and recoding given the respondents were able to answer with negative values if they did answer or were not able to answer the question. By resetting those values to NA, we can continue with analysis of the variables.

The confidence variables show some level of correlation with income while avoiding major collinearity between each other. This means they are candidates to be included into the regression model.

lm.confidence <- lm(income ~ race + sex + TotalIncarcerations + ParentNetworth + HGC_BIO_DAD1
                   + HGC_BIO_MOM1 + FeelSafe, data = nlsy1, na.action=na.exclude)
kable(round(summary(lm.confidence)$coefficients, 3), format='markdown')
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23329.590 3076.327 7.584 0.000
raceBlack -6865.316 1354.162 -5.070 0.000
raceHispanic 1584.384 1468.403 1.079 0.281
raceMixed Race(Non-Hispanic) -1089.314 5189.113 -0.210 0.834
sexMale 12602.617 1013.713 12.432 0.000
TotalIncarcerations -8067.226 1029.556 -7.836 0.000
ParentNetworth 0.029 0.004 7.212 0.000
HGC_BIO_DAD1 668.107 213.485 3.130 0.002
HGC_BIO_MOM1 687.997 234.286 2.937 0.003
FeelSafeAgree -2333.237 1114.878 -2.093 0.036
FeelSafeDisagree -6051.133 1828.914 -3.309 0.001
FeelSafeStrongly Disagree -6384.526 3301.680 -1.934 0.053

‘FeelSafe’ is a significant contribution to the model, proof of so being if we run an ANOVA test comparing the results of the model without its inclusion to the results of it being included.

test.list <- nlsy[ , c('income', 'race', 'sex','TotalIncarcerations','ParentNetworth','HGC_BIO_DAD1', 'HGC_BIO_MOM1','FeelSafe', 'Parent20','Jail20','HSD20','College30')]
nlsy1.test = na.omit(test.list)
nlsy1.test = nlsy1.test %>% filter(FeelSafe!="(Missing)")
lm.base <- lm(income ~ race + sex + TotalIncarcerations + ParentNetworth + HGC_BIO_DAD1
                   + HGC_BIO_MOM1, data = nlsy1.test, na.action=na.exclude)
lm.confidence <- lm(income ~ race + sex + TotalIncarcerations + ParentNetworth + HGC_BIO_DAD1
                   + HGC_BIO_MOM1 + FeelSafe, data = nlsy1.test, na.action=na.exclude)
anova(lm.base, lm.confidence)
## Analysis of Variance Table
## 
## Model 1: income ~ race + sex + TotalIncarcerations + ParentNetworth + 
##     HGC_BIO_DAD1 + HGC_BIO_MOM1
## Model 2: income ~ race + sex + TotalIncarcerations + ParentNetworth + 
##     HGC_BIO_DAD1 + HGC_BIO_MOM1 + FeelSafe
##   Res.Df        RSS Df  Sum of Sq      F Pr(>F)  
## 1   1067 8.2632e+11                              
## 2   1064 8.1982e+11  3 6496178001 2.8103 0.0384 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.confidence)
## 
## Call:
## lm(formula = income ~ race + sex + TotalIncarcerations + ParentNetworth + 
##     HGC_BIO_DAD1 + HGC_BIO_MOM1 + FeelSafe, data = nlsy1.test, 
##     na.action = na.exclude)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -61948 -19351  -3352  15291  96999 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   3.149e+04  5.216e+03   6.038 2.16e-09 ***
## raceBlack                    -5.832e+03  2.284e+03  -2.553   0.0108 *  
## raceHispanic                 -1.553e+03  2.489e+03  -0.624   0.5330    
## raceMixed Race(Non-Hispanic) -1.045e+04  8.903e+03  -1.173   0.2409    
## sexMale                       1.361e+04  1.717e+03   7.927 5.63e-15 ***
## TotalIncarcerations          -7.661e+03  1.724e+03  -4.442 9.82e-06 ***
## ParentNetworth                3.919e-02  6.562e-03   5.972 3.19e-09 ***
## HGC_BIO_DAD1                  4.863e+01  3.559e+02   0.137   0.8913    
## HGC_BIO_MOM1                  7.341e+02  3.851e+02   1.906   0.0569 .  
## FeelSafeAgree                -4.323e+03  1.907e+03  -2.266   0.0236 *  
## FeelSafeDisagree             -7.791e+03  3.117e+03  -2.500   0.0126 *  
## FeelSafeStrongly Disagree    -6.389e+03  5.352e+03  -1.194   0.2328    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27760 on 1064 degrees of freedom
## Multiple R-squared:  0.1556, Adjusted R-squared:  0.1469 
## F-statistic: 17.82 on 11 and 1064 DF,  p-value: < 2.2e-16

‘FeelSafe’ is a significant categorical variable at the 0.05% level.

Next we want to test the impact of ‘HSD20’ on our regression by doing an Analysis of Variance between the model without ‘HSD20’ and the model with it included:

lm.confidence <- lm(income ~ race + sex + TotalIncarcerations + ParentNetworth + HGC_BIO_DAD1
                   + HGC_BIO_MOM1 + FeelSafe, data = nlsy1.test, na.action=na.exclude)
lm.confidence2 <- lm(income ~ race + sex + TotalIncarcerations + ParentNetworth + HGC_BIO_DAD1
                   + HGC_BIO_MOM1 + FeelSafe + HSD20, data = nlsy1.test, na.action=na.exclude)
kable(round(summary(lm.confidence2)$coefficients, 3), format='markdown')
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17500.863 7328.448 2.388 0.017
raceBlack -5638.652 2278.676 -2.475 0.013
raceHispanic -1453.728 2482.309 -0.586 0.558
raceMixed Race(Non-Hispanic) -10097.865 8877.295 -1.137 0.256
sexMale 13952.126 1716.739 8.127 0.000
TotalIncarcerations -7347.202 1723.171 -4.264 0.000
ParentNetworth 0.039 0.007 5.886 0.000
HGC_BIO_DAD1 -38.452 356.300 -0.108 0.914
HGC_BIO_MOM1 694.981 384.224 1.809 0.071
FeelSafeAgree -4430.058 1902.239 -2.329 0.020
FeelSafeDisagree -7183.881 3115.849 -2.306 0.021
FeelSafeStrongly Disagree -4409.923 5385.846 -0.819 0.413
HSD20 160.605 59.255 2.710 0.007
anova(lm.confidence, lm.confidence2)
## Analysis of Variance Table
## 
## Model 1: income ~ race + sex + TotalIncarcerations + ParentNetworth + 
##     HGC_BIO_DAD1 + HGC_BIO_MOM1 + FeelSafe
## Model 2: income ~ race + sex + TotalIncarcerations + ParentNetworth + 
##     HGC_BIO_DAD1 + HGC_BIO_MOM1 + FeelSafe + HSD20
##   Res.Df        RSS Df  Sum of Sq      F   Pr(>F)   
## 1   1064 8.1982e+11                                 
## 2   1063 8.1420e+11  1 5626930844 7.3464 0.006828 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results show that an individual falling in the highest confidence bracket for high school graduation by 20 years old has a significant impact on income later in life at the 0.01% significance level. These results show that we should include HSD20 in our regression.

We can repeat the process for ‘College30’, ‘Jail20’, and ‘Parent20’ to test for the significance and impact of their inclusion to our final model.

lm.confidence2 <- lm(income ~ race + sex + TotalIncarcerations + ParentNetworth + HGC_BIO_DAD1
                   + HGC_BIO_MOM1 + FeelSafe + HSD20, data = nlsy1.test, na.action=na.exclude)
lm.confidence3 <- lm(income ~ race + sex + TotalIncarcerations + ParentNetworth + HGC_BIO_DAD1
                   + HGC_BIO_MOM1 + FeelSafe + HSD20 + College30, data = nlsy1.test, na.action=na.exclude)
anova(lm.confidence2, lm.confidence3)
## Analysis of Variance Table
## 
## Model 1: income ~ race + sex + TotalIncarcerations + ParentNetworth + 
##     HGC_BIO_DAD1 + HGC_BIO_MOM1 + FeelSafe + HSD20
## Model 2: income ~ race + sex + TotalIncarcerations + ParentNetworth + 
##     HGC_BIO_DAD1 + HGC_BIO_MOM1 + FeelSafe + HSD20 + College30
##   Res.Df        RSS Df  Sum of Sq      F Pr(>F)
## 1   1063 8.1420e+11                            
## 2   1062 8.1223e+11  1 1969442710 2.5751 0.1089

‘College30’ is not significant at any level, so we can drop it from our final regression.

lm.confidence4 <- lm(income ~ race + sex + TotalIncarcerations + ParentNetworth + HGC_BIO_DAD1
                   + HGC_BIO_MOM1 + FeelSafe + HSD20 + Jail20, data = nlsy1.test, na.action=na.exclude)
anova(lm.confidence2, lm.confidence4)
## Analysis of Variance Table
## 
## Model 1: income ~ race + sex + TotalIncarcerations + ParentNetworth + 
##     HGC_BIO_DAD1 + HGC_BIO_MOM1 + FeelSafe + HSD20
## Model 2: income ~ race + sex + TotalIncarcerations + ParentNetworth + 
##     HGC_BIO_DAD1 + HGC_BIO_MOM1 + FeelSafe + HSD20 + Jail20
##   Res.Df        RSS Df  Sum of Sq      F  Pr(>F)  
## 1   1063 8.1420e+11                               
## 2   1062 8.1191e+11  1 2291547867 2.9974 0.08369 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

‘Jail20’ inclusion is not significant at any level, so we can drop it from our final regression.

lm.confidence5 <- lm(income ~ race + sex + TotalIncarcerations + ParentNetworth + HGC_BIO_DAD1
                   + HGC_BIO_MOM1 + FeelSafe + HSD20 + Parent20, data = nlsy1.test, na.action=na.exclude)
anova(lm.confidence2, lm.confidence5)
## Analysis of Variance Table
## 
## Model 1: income ~ race + sex + TotalIncarcerations + ParentNetworth + 
##     HGC_BIO_DAD1 + HGC_BIO_MOM1 + FeelSafe + HSD20
## Model 2: income ~ race + sex + TotalIncarcerations + ParentNetworth + 
##     HGC_BIO_DAD1 + HGC_BIO_MOM1 + FeelSafe + HSD20 + Parent20
##   Res.Df        RSS Df  Sum of Sq      F Pr(>F)
## 1   1063 8.1420e+11                            
## 2   1062 8.1245e+11  1 1746444078 2.2829 0.1311

The results from these tests show that none of the three variables contribute a significant impact to income later in life compared to the model that just houses ‘HSD20’ and ‘FeelSafe’. For that reason, we ignore ‘Jail20’, ‘College30’, and ‘Parent20’.

Methodology

Our data selection stemmed around our initial research question: what effect do households and confidence levels at a young age have on the wage gap between Whites and Blacks later in life? In order to answer this question, we selected 11 variables for further testing and analysis.

Household Variables: (‘TotalIncarceration’, ‘ParentNetworth’, ‘HGC_BIO_DAD’, ‘HGC_BIO_MOM’). These four variables are reflective of the household in which the individual was raised. Household networth is a proxy of family income and economic status, and illuminates the effect of generational wealth on income levels later in life. Mother and father’s education is a proxy for generational educational opportunities and provides insight to the pattern of attainment the individual was raised with. Total incarceration is the outlier in this group given that it is not reflective of the family but instead of the individual. However, given large numbers of incarcerations is linked with a lack of familial and/or communal support, it can seen as a proxy for support systems.

Confidence Variables: (‘FeelSafe’, ‘HSD20’, ‘Jail20’, ‘Parent20’, ‘College30’). These five variables are intended to provide insight to the individual’s state of mind at a young age. The questions cover education and personal life, and outline their expectations for their own future. However, this variable set was the most problematic to work with due to the low response rate. Overall, only two were deemed viable for inclusion into the final regression.

Missing Data: Missing data proved to a major barrier in both variable groups. For father and mother education, significant recoding was needed to eliminate negative variables and convert them to NA. For the confidence variables, low response rates restricted the impact of most of the variables. Since it was a personal response, coming up with an alternative value to place into the table in lieu of NA did not seem like a methodologically-sound option. Instead, removal from the regression was the best option. For regression comparison, the removal of NA’s was necessary to see ANOVA results. na.omit() was used to remove NA and achieve data set parity. By the start of our regression analysis, na.omit() left us 403 observations that had values for every variable we were targeting.

Topcoded Variables: The only topcoded variable in our regression was the depedent variable, income. Given the fact that so few of the respondents fell into the top 2%, we did not feel it was necessary to make any major changes to the values. However, this skews our results as the effect of high earners is muted. Since everyone above the $50,000 mark will be assigned the average, those who earned very highly will not have as large as an effect on the regression as they should. Without access to the respondents to retrieve the real income level, the best option was to accept the skewed results.

Failed Trends: One trend investigated was the relationship between race and total incarcerations. Given the prevalence of recidivism research in the US, I was expecting a significant result from the interaction term. Furthermore, I was expecting more correlation between the variable groups, especially the confidence set. Those who responded that they would graduate high school by 20 confidently were not necessarily more likely to respond as confidently to, say, graduating college by 30.

Not Included: Like above, the relationship between race and total incarcerations are not included in our findings because they were not significant. However, they were investigated. Furthermore, the full extent of the relationship between income and confidence at a young age are not included because we only included 2 variables from the confidence set. While ‘HSD20’ and ‘College30’ provide some insight to the effect of confidence in one’s future on future income, they are not able to provide the complete relationship. Our findings does not include more detailed discussion around ‘FeelSafe’ even though, personally, I found this to be a very telling variable. Feeling less safe at school at a young age, regardless of other factors, did not lead to a significant decrease in future income. This is notable for educators, school districts, and legislators when considering education reform.

Final Analysis: Our regression settled on the full inclusion of household variables (‘TotalIncarceration’, ‘ParentNetworth’, ‘HGC_BIO_DAD’, ‘HGC_BIO_MOM’) and two of the confidence variables (‘FeelSafe’, ‘HSD20’). These 6, plus race and sex, compose our final regression model.

Findings

Our research set out to better understand the relationship between early childhood households and confidence levels and income levels later in life. Initially, we found evidence of race-based differences in average income later in life.

The plot above clearly shows that the average income amongst Whites is larger than the average amongst Blacks and Hispanics at a statistically significant level, while Mixed Race is too uncertain to gain any insight. Since our research question is focused on the difference between Blacks and Whites, this does not pose a problem.

However, our interpretation is not limited to Black males. Our initial data showed that Black men and Black women experienced different income effects.

by.race.sex <- nlsy %>% group_by(race, sex) %>%
    summarise('Avg_Income' = mean(income))
## `summarise()` regrouping output by 'race' (override with `.groups` argument)
kable(by.race.sex %>% arrange(desc(Avg_Income)))
race sex Avg_Income
Mixed Race(Non-Hispanic) Male 56240.00
Non-Black/Non-Hispanic Male 55835.37
Hispanic Male 49894.11
Non-Black/Non-Hispanic Female 42658.43
Mixed Race(Non-Hispanic) Female 41403.95
Black Male 39082.41
Hispanic Female 36331.48
Black Female 33726.46

While Black Women report more income parity with Black Men, Black Women report the lowest average income out of any race:sex group. Black Men report the lowest average income out of any race:Male group, and even perform lower than some race:Female groups. Given these patterns of disparity between racial groups, our next step was to investigate household variables on income. The final dataset we are working with contains 1,076 observations.

Estimate Std. Error t value Pr(>|t|)
(Intercept) 27553.175 5034.179 5.473 0.000
raceBlack -6606.471 2272.532 -2.907 0.004
raceHispanic -1511.685 2488.065 -0.608 0.544
raceMixed Race(Non-Hispanic) -12632.008 8887.712 -1.421 0.156
sexMale 13744.535 1720.795 7.987 0.000
ParentNetworth 0.041 0.007 6.307 0.000
TotalIncarcerations -7874.210 1720.498 -4.577 0.000
HGC_BIO_MOM1 715.940 385.722 1.856 0.064
HGC_BIO_DAD1 104.704 355.146 0.295 0.768

Using plot() to generate diagnostic graphs for this regression model, it is clear that our regression struggles to fit estimated values to the observed values. However, given the longitudinal study has major outliers, the overall performance of the model is not bad. The normal Q-Q shows that our performance is particularly weaker in the tails of the data, which was expected given the effect of topcoded variables and the lack of full responses to every question used in the model. Since we removed the outliers of the topcoded income variables and cleaned our data as best as we could, the model fits well to the dataset we’re working with. Given these weaknesses, it is important to note that all the included target variables are significant at any level.

The next stage of analysis was to include the set of confidence variables. These proxies for the individual’s belief in his or her future are a subjective measure of confidence. Subjectivity in the responses is a negative since it is not a truly objective measure of confidence, and instead relies on a myriad of factors (for example, how the respondent was feeling that particular day). However, given the inavailablity of objective confidence tests, both today and at the time of the study, it is safe to say that this subjective measure is still the best measure available to us.

‘FeelSafe’ is a categorical variable. An analysis of variance test will provide insight into the significance of this categorical variable on our overall model. In order to compare the models, we must handle the missing values in the dataset. Since responses were not mandatory to these questions, the datasets in use are not equal.

lm.final2 <- lm(income ~ race + sex + TotalIncarcerations + ParentNetworth + HGC_BIO_DAD1
                   + HGC_BIO_MOM1, data = nlsy1.test.final, na.action=na.exclude)
lm.final3 <- lm(income ~ race + sex + TotalIncarcerations + ParentNetworth + HGC_BIO_DAD1
                   + HGC_BIO_MOM1 + FeelSafe, data = nlsy1.test.final, na.action=na.exclude)
anova(lm.final2, lm.final3)
## Analysis of Variance Table
## 
## Model 1: income ~ race + sex + TotalIncarcerations + ParentNetworth + 
##     HGC_BIO_DAD1 + HGC_BIO_MOM1
## Model 2: income ~ race + sex + TotalIncarcerations + ParentNetworth + 
##     HGC_BIO_DAD1 + HGC_BIO_MOM1 + FeelSafe
##   Res.Df        RSS Df  Sum of Sq      F Pr(>F)  
## 1   1067 8.2632e+11                              
## 2   1064 8.1982e+11  3 6496178001 2.8103 0.0384 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results of the ANOVA test show that ‘FeelSafe’ is a significant predictor of income at any level and does deserve inclusion into the model. We can now move on to test the remaining numeric variables in the confidence set. As stated before, the ANOVA tests showed that none of the confidence variables had a significant impact on our regression. Our final model is as follows.

## Analysis of Variance Table
## 
## Model 1: income ~ race + sex + TotalIncarcerations + ParentNetworth + 
##     HGC_BIO_DAD1 + HGC_BIO_MOM1 + FeelSafe
## Model 2: income ~ race + sex + TotalIncarcerations + ParentNetworth + 
##     HGC_BIO_DAD1 + HGC_BIO_MOM1 + FeelSafe + HSD20
##   Res.Df        RSS Df  Sum of Sq      F   Pr(>F)   
## 1   1064 8.1982e+11                                 
## 2   1063 8.1420e+11  1 5626930844 7.3464 0.006828 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17500.863 7328.448 2.388 0.017
raceBlack -5638.652 2278.676 -2.475 0.013
raceHispanic -1453.728 2482.309 -0.586 0.558
raceMixed Race(Non-Hispanic) -10097.865 8877.295 -1.137 0.256
sexMale 13952.126 1716.739 8.127 0.000
TotalIncarcerations -7347.202 1723.171 -4.264 0.000
ParentNetworth 0.039 0.007 5.886 0.000
HGC_BIO_DAD1 -38.452 356.300 -0.108 0.914
HGC_BIO_MOM1 694.981 384.224 1.809 0.071
FeelSafeAgree -4430.058 1902.239 -2.329 0.020
FeelSafeDisagree -7183.881 3115.849 -2.306 0.021
FeelSafeStrongly Disagree -4409.923 5385.846 -0.819 0.413
HSD20 160.605 59.255 2.710 0.007

Our final regression is shown above. The inclusion of HSD20 is significant at the 0.01% level, bringing our final regression against income to be (race + sex + networth + total incarcerations + mother education + father education + feel safe + high school degree by 20).

Again using plot() generates diagnostic graphs for our expanded model. The inclusion of the two confidence variables does not improve the fitted vs residual but it also does not negatively impact the error measures. Similarly the Normal Q-Q seems to improve slightly at the upper tail with a few more observations falling to ~2 residuals rather than ~4.

The model expansions prove that our findings are not well situated for those in the tail ends of the distribution. Given the topcoding of income, it was expected that our model would suffer on extremities. Our final coefficient on Black -$5638.65 is significant at a 0.01 level, and shows that Black males earn significantly less than their White counterparts. The 95% confidence interval for the coefficient is -$7917.33 - -$3359.91.51. More importantly, our coefficient on Female -$13592.17 coupled with our coefficient on Black -$5638.65 shows that Black females are earning much less than their White counterparts. This is particularly notable given the inclusion of our other variables pertaining to household characteristics and confidence measures. Even given all those are the same between White and Black females, some other factor through life is causing a major difference in average income.

Discussion

Our findings provide an important insight into the broader investigation of the White-Black wage gap in the United States. While other studies focus on other factors between White and Black individuals, we believe that our report sheds more light on the lightly-researched relationship between early childhood households and developed confidence and later-in-life income.

However, our findings are not perfect in any sense. Mandatory responses to variables used within our regression analysis would have greatly improved our findings and lowered overall variance. Moreover, the data provided from 1997 is almost 25 years old and may not be as applicable to today’s world. Furthermore, the variance present in our model proves that further research is necessary to pin down a better estimate of the relationship in question.